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High  Productivity  MPI  is  an  approach  to  extending  MPI  to  support  multiple 

•  Implementations  (IMPI,  IMPI-2) 

•  Owner  domains 

•  Architectures 

•  Networks 

•  Operating  Systems 

•  Faults 

•  Interacting  dynamic  groups 

without  relying  on  a  two-level  implementation  (one  MPI  implementation  calling  another).  MPI 
implementations  must  be  able  to  connect,  reconnect,  and  work  well  with  dynamic,  intermittent 
resources,  under  the  expectation  that  user  applications  will  also  become  somewhat  fault-aware  in 
order  to  retain  scalability. 


This  paper  addresses  the  many  concerns  that  arise  in  offering  composable  sessions  in  which 
multiple-vendor  MPI’s  can  be  supported  (starting  from  but  not  ending  with  IMPI  protocol). 
Experiences  with  IMPI,  and  a  new  proposal,  IMPI-2,  are  offered.  This  paper  addresses  specific 
issues  about  interoperating  the  gamut  of  MPI-2  services  in  the  interoperable  setting,  which  to  our 
knowledge  have  not  been  addressed  elsewhere. 

The  results  of  this  work  are  open  specifications,  together  with  our  own  vendor-implementation  of 
these  MPI  capabilities.  Other  open  and  commercial  MPI’s  could  adopt  IMPI  plus  these  other 
extensions  in  order  to  participate  in  the  hierarchical,  heterogeneous,  grid  computing  settings, 
without  mandating  new  MPI  implementations  in  such  settings.  The  authors’  goals  in  offering 
these  new  protocols  as  proposals  is  to  encourage  the  High  Productivity  Computer  world  to  enter 
into  significant  discussions  about  their  adoption.  The  goal  is  to  offer  these  capabilities  without 
mandating  grid-computing  infrastructure. 


Specific  requirements  for  supporting  several  overlapping  and  non-overlapping  networks  of 
varying  performance  (overhead,  bandwidth,  latency,  concurrency)  are  discussed,  in  terms  of 
progressive  MPI  implementations,  and  the  joint  progressive  nature  of  compliant  MPI’s  working 
together  with  the  IMPI-2  protocol  framework.  Note  that  existing  multi-cluster/grid  solutions 
apparently  cause  excessive  polling,  and  do  not  support  the  degree  of  scalability  or  appropriate 
intra-network  performance  that  would  otherwise  pertain  to  correctly  composed  MPI 
implementations. 


In  the  spirit  of  pursuing  practical  fault  tolerance  in  this  setting,  the  extension  of  checkpoint- 
restart  facilities  to  High  Productivity  MPI  is  considered  both  from  the  perspective  of  MPI  I/O  on 
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single  implementations,  and  in  terms  of  the  MPI  I/O  as  extended  to  multiple  implementations 
connected  through  an  IMPI-2  protocol  framework. 


Other  aspects  of  fault-tolerant  MPI  are  beyond  the  scope  of  this  paper. 

The  authors  note  that  there  are  many  thorny  issues  associated  with  supporting  all  the  features  of 
MPI-2  across  multiple  vendor  implementations,  and  several  of  these  issues  are  highlighted.  A 
“core  MPI-2”  (subset  of  MPI- 1  plus  subset  of  MPI-2)  plus  some  additional  dynamic  processing 
extensions  are  suggested  as  a  best  practice  for  computing  in  this  setting. 


Networks  of  grids  (or  multi-clustering)  represent  an  interesting  capability  but  also  a  challenge  in 
terms  of  the  publication  of  the  entire  structure  (including  IP  addresses)  globally.  This  work  also 
considers  the  use  of  techniques  such  as  NAT  and  port  forwarding,  together  with  gateway  nodes, 
in  order  to  allow  for  structured,  manageable  descriptions  of  hierarchical  parallel  resources,  so 
that  appropriate  communication  bandwidth  remains  possible  between  clusters,  without 
mandating  a  public  model  for  all  resources  involved.  This  has  been  accomplished  with  IMPI  as 
is,  and  extensions  in  the  IMPI-2  framework  are  discussed. 


The  paper  also  tries  to  address  some  of  the  drawbacks  with  existing  IMPI  protocols  including 

•  Involvement  of  user  in  starting  MPI  jobs  distributed  across  multiple  platforms 

•  Global  collective  operations  mandated  by  standard 

•  Non-portable  parallel  job  startup  mechanism 


Figure  1 :  Connectivity  Architecture  of  a  multi-cluster,  multi-implementation  MPI 


Figure  1  shows  a  typical  scenario  where  multiple-clusters  (possibly  under  different  owner 
domains  and  different  internal  networks)  having  multiple  implementations  of  MPI.  Our  approach 
enables  these  multiple  implementations  to  work  seamlessly  without  requiring  a  new-layered 
implementations  and  also  enables  individual  implementations  to  use  proprietary  and  optimized 
communication  protocols  with  no  increased  overhead.  The  IMPI-1  implementation,  which  is 
currently  available  as  part  of  MPIPro,  was  supported  by  NIST  through  the  Contract  #  50-DKNB- 
1-SB082. 
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IMPI 


•  IMPI  -  Interoperable  Message  Passing  Interface 

•  Developed  and  Proposed  by  NIST 

•  Standard  for  inter-operation  of  multiple 

-  Implementations  (IMPI,  IMPI-2) 

-  Architectures 

-  Networks 

-  Operating  Systems 
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Client,  Host  Concept 


•  MPI  processes  spread  across  multiple  clients 

•  Clients  represent  MPI  processes  belonging  to  a  single 
implementation 

•  Hosts  represent  gateways  for  processes  of  Clients 

•  IMPI  Application  may  have  two  or  more  clients 

•  Client  may  have  one  or  more  hosts 

•  Hosts  serve  as  gateways  for  one  or  more  MPI  processes 
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Cluster  2 
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Typical  Scenario  - 
Multi-vendor  MPI 


•  3  Clients  (Each  cluster  make 
one  client) 

•Client  1 

•2  hosts,  2  MPI  processes 
•Client  2 

•1  host,  3  MPI  processes 


•Client  3 

•2  host,  6  MPI  processes 


MPI/Pro  1.7.0 


•  MPI/Pro  1.7.0  provides  first  complete  implementation  of 
IMPI 

•  Enables  Interoperation  between 

-  Windows,  Linux  and  Mac  OSX  operating  systems 

-  32-bit  and  64-bit  architectures 

-  TCP,  GM  and  VAPI  Networks 

-  Any  combination  of  all  the  above 
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Extensions 


•  IMPI  does  not  address  issues  such  as 

-  Private  IP  Addresses 

-  Owner  domains 

-  Faults 

-  Interacting  Dynamic  Groups 

•  Above  issues  play  vital  role  in  Grid 

•  Verari  proposed  and  implemented  a  novel  method  to 
address  issue  of  Private  IP  Addresses 
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Case  Study 


Private  IP  Enabled  IMPI 


Compute  Nodes 


erari 
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Typical  Cluster  Setup 
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•Compute  Nodes 
have  private  IP 
addresses 

•External 
communication 
through  single  head 
node  or  gateway 

•Unsuitable  for  multi 
cluster  multi-site 
communication 
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Network  Address 
Translation  (NAT) 


•Firewalls  block 

incoming 

connections 

•NAT  used  to  serve 
requests  generated 
within  the  cluster 


IMPI  Server 


NAT-based  IMPI 


•Use  NAT  to  generate 
dynamic  mappings 
between  head  node  and 
compute  nodes 

•Dissipate  dynamic 
mapping  info  through 
IMPI  server 

•Release  mapped  ports 
on  head  node  on 
completion  of 
application 
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MB/s 
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IMPI  Bandwidth  — MPI/Pro 


Message  Length  (KBytes) 
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Performance 


Configuration 

Latency  (us) 

MPI/Pro 
without  IMPI 

142.45 

MPI/Pro  with 

IMPI 

147.35 
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Performance 


IMPI  using  NAT  -  Bandwidth 

— ♦ —  Bandw  idth 


Streaming 
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Proposed  Extensions 


•  IMPI  extensions  for  MPI-2 

•  Open  protocol-based  initialization  such  as  SOAP 

•  Adaptation  to  the  Grid 

•  Reduce  user  involvement 

•  Optimize  for  performance 
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